Creating the Tables in R

We started the project by uploading the data from the 7 CSV files that were uploaded in MySQL.

pacman::p_load(RMySQL)

mysqlconnection = dbConnect(
  RMySQL::MySQL(),
  dbname='gedata',
  host='mysql.fsb.miamioh.edu',
  port=3306,
  user='fsbstud',
  password='fsb4you')


dbListTables(mysqlconnection)
## [1] "engine_data_aic"          "engine_data_axm"         
## [3] "engine_data_fron"         "engine_data_pgt"         
## [5] "esn_rul"                  "lkp_airport_codes_t"     
## [7] "manufacturing_sql_by_esn"

Next, we created data frames in R from the data tables hosted in MySQL. We uploaded the data from each CSV file into it’s own data frame.

engine_data_aic= RMySQL::dbReadTable(mysqlconnection, name= "engine_data_aic")
engine_data_axm= RMySQL::dbReadTable(mysqlconnection, name= "engine_data_axm")
engine_data_fron= RMySQL::dbReadTable(mysqlconnection, name= "engine_data_fron")
engine_data_pgt= RMySQL::dbReadTable(mysqlconnection, name= "engine_data_pgt")
esn_rul= RMySQL::dbReadTable(mysqlconnection, name= "esn_rul")
lkp_airport_codes_t= RMySQL::dbReadTable(mysqlconnection, name= "lkp_airport_codes_t")
manufacturing_sql_by_esn= RMySQL::dbReadTable(mysqlconnection, name= "manufacturing_sql_by_esn")

Then, we set each dataset as its own data frame in R.

engine_data= rbind(engine_data_aic, engine_data_axm, engine_data_fron, engine_data_pgt)

Joining the esn_rul table

Now, we joined the esn_rul table to the table with all of the engine data using an inner join.

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
esn_joined_data= inner_join(engine_data, esn_rul, na_matches= 'na') # gives only training data
## Joining, by = "esn"

Joining the manufacturing_sql_by_esn table

We then joined the manufacturing data table to the combined table using an inner join.

join_data= inner_join(esn_joined_data, manufacturing_sql_by_esn, na_matches= 'na') 
## Joining, by = "esn"

Imputing Nulls

The reasoning for this step is discussed under the “Consistent Data” header, it just needs to be placed here for the dataset to be created properly.

for (i in 1:length(join_data$depart_icao)) {
  if(join_data$depart_icao[i] == "")
    join_data$depart_icao[i] <- join_data$destination_icao[i-1]
  if(join_data$destination_icao[i] == "")
    join_data$destination_icao[i] <- join_data$depart_icao[i]}

Joining the lkp_airport_codes_t table

Finally, we joined the airport codes table to our combined table using a set of inner joins. We joined this table on both the depart_icao and the destination_icao columns since we need the coordinates for both airports in order to calculate the distance between airports.

airport_depart= left_join(join_data, lkp_airport_codes_t, by= c("depart_icao"= "airport_icao"))
airport_dest= left_join(airport_depart, lkp_airport_codes_t, by= c("destination_icao"= "airport_icao"))
airport_final= airport_dest

Calculating the Distance between Airports

Next, we used the distHaversine function from the geosphere package to calculate the distance between airports. The answer gave us the distance in meters, so we converted it to miles.

pacman::p_load(swfscMisc)
pacman::p_load(geosphere)
airport_final$distance= distHaversine(p1= cbind(airport_final$longitude.x, airport_final$latitude.x), p2= cbind(airport_final$longitude.y, airport_final$latitude.y))
airport_final$distance= airport_final$distance / 1000
airport_final$distance= convert.distance(airport_final$distance, from = "km", to = "mi")

Cleaning the Data

Next, we ensured that the data was tidy, technically correct, and consistent.

Tidy Data

We started by checking for tidiness. Through our observation of the data, we concluded that the datetime column needed to be split into date and time in order for the data to be tidy.

pacman::p_load(lubridate, tidyr, hms)
airport_final$datetime= as_datetime(airport_final$datetime)
Tech_Correct = separate(airport_final, datetime, c("date", "time"), sep = ' ')

Technically Correct

Now that we have ensured data tidiness, we started to make sure the data was technically correct. We cleaned the column names using the clean names function and also used glimpse to check for data types. We discovered that we needed to switch the date and time columns to be of the date and time data types. We also switched the names of the latitude and longitude variables to make them easier to read and understand.

pacman::p_load(janitor)
library(janitor)
Tech_Correct = clean_names(Tech_Correct)
glimpse(Tech_Correct)
## Rows: 13,096
## Columns: 45
## $ dataset                      <chr> "test_FD001", "test_FD001", "test_FD001",…
## $ esn                          <int> 999120, 999120, 999120, 999120, 999120, 9…
## $ unit                         <int> 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 2…
## $ flight_cycle                 <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13…
## $ date                         <chr> "2018-02-01", "2018-02-01", "2018-02-08",…
## $ time                         <chr> "13:47:42", "18:22:24", "00:46:00", "03:5…
## $ operator                     <chr> "AIC", "AIC", "AIC", "AIC", "AIC", "AIC",…
## $ depart_icao                  <chr> "LFBO", "LFBO", "VIDP", "VABB", "VIDP", "…
## $ destination_icao             <chr> "LFBO", "VIDP", "VABB", "VIDP", "VABB", "…
## $ hpc_eff_mod                  <dbl> 0.0011, -0.0001, -0.0015, -0.0022, -0.006…
## $ hpc_flow_mod                 <dbl> -5e-04, -3e-04, 3e-04, 1e-04, -1e-04, 5e-…
## $ tra                          <int> 100, 100, 100, 100, 100, 100, 100, 100, 1…
## $ t2                           <dbl> 518.67, 518.67, 518.67, 518.67, 518.67, 5…
## $ t24                          <dbl> 182.59, 182.56, 182.54, 183.01, 182.73, 1…
## $ t30                          <dbl> 1588.38, 1590.73, 1590.70, 1588.91, 1588.…
## $ t50                          <dbl> 1397.780, 1405.880, 1401.310, 1403.280, 1…
## $ p2                           <dbl> 14.62, 14.62, 14.62, 14.62, 14.62, 14.62,…
## $ p15                          <dbl> 21.61, 21.61, 21.60, 21.61, 21.61, 21.61,…
## $ p30                          <dbl> 554.43, 553.62, 554.24, 553.79, 553.90, 5…
## $ nf                           <dbl> 2388.07, 2388.06, 2388.08, 2388.06, 2388.…
## $ nc                           <dbl> 9072.24, 9059.50, 9064.83, 9058.04, 9064.…
## $ epr                          <dbl> 1.3, 1.3, 1.3, 1.3, 1.3, 1.3, 1.3, 1.3, 1…
## $ ps30                         <dbl> 47.40, 47.32, 47.19, 47.28, 47.42, 47.21,…
## $ phi                          <dbl> 521.88, 522.03, 521.80, 522.25, 522.17, 5…
## $ nrf                          <dbl> 2388.06, 2388.06, 2388.03, 2388.06, 2388.…
## $ nrc                          <dbl> 8147.58, 8151.74, 8146.37, 8144.65, 8146.…
## $ bpr                          <dbl> 8.3923, 8.4385, 8.4234, 8.3955, 8.4371, 8…
## $ farb                         <dbl> 0.03, 0.03, 0.03, 0.03, 0.03, 0.03, 0.03,…
## $ htbleed                      <int> 393, 392, 392, 391, 393, 394, 391, 392, 3…
## $ nf_dmd                       <int> 2388, 2388, 2388, 2388, 2388, 2388, 2388,…
## $ pcnfr_dmd                    <int> 100, 100, 100, 100, 100, 100, 100, 100, 1…
## $ w31                          <dbl> 38.76, 38.90, 38.96, 38.94, 38.94, 39.11,…
## $ w32                          <dbl> 23.4970, 23.4240, 23.4460, 23.3412, 23.44…
## $ rul                          <int> 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 1…
## $ x44321p02_op016_median_first <dbl> 23.25222, 23.25222, 23.25222, 23.25222, 2…
## $ x44321p02_op420_median_first <dbl> 14.54578, 14.54578, 14.54578, 14.54578, 1…
## $ x54321p01_op116_median_first <dbl> 22.15643, 22.15643, 22.15643, 22.15643, 2…
## $ x54321p01_op220_median_first <dbl> 29.89512, 29.89512, 29.89512, 29.89512, 2…
## $ x65421p11_op232_median_first <dbl> 188.632, 188.632, 188.632, 188.632, 188.6…
## $ x65421p11_op630_median_first <dbl> 231.0214, 231.0214, 231.0214, 231.0214, 2…
## $ latitude_x                   <dbl> 43.635, 43.635, 28.569, 19.092, 28.569, 1…
## $ longitude_x                  <dbl> 1.368, 1.368, 77.112, 72.866, 77.112, 72.…
## $ latitude_y                   <dbl> 43.635, 28.569, 19.092, 28.569, 19.092, 1…
## $ longitude_y                  <dbl> 1.368, 77.112, 72.866, 77.112, 72.866, 76…
## $ distance                     <dbl> 0.0000, 4211.6346, 708.2600, 708.2600, 70…
colnames(Tech_Correct)[41]= "depart_latitude"
colnames(Tech_Correct)[42]= "depart_longitude"
colnames(Tech_Correct)[43]= "dest_latitude"
colnames(Tech_Correct)[44]= "dest_longitude"

Tech_Correct$date = as.Date(Tech_Correct$date)
Tech_Correct$time= as_hms(Tech_Correct$time)

Consistent Data

Finally, we checked our data for consistency. We found that some fields were missing information for their departure and arrival airports. Under further investigation, it seems that these planes were inspected but did not leave the airport, or possibly were delayed or took off and aborted. We ordered the data by esn and flight cycle and realized that the data fields with emoty strings for depart_icao and destination_icao and realized that the destination_icao of the record before and the depart_icao of the record after were the same. We concluded that these planes in fact did not leave the airport for some reason but were still inspected. We use for loops to impute the departure airport information of the flight directly before to get rid of the nulls and allow R to fill in the information for latitude, longitude, and distance (which is 0 for all of the planes we imputed in this case). This imputation allows us to avoid nulls and to understand that these planes did not move but their inspection values still may be relevant for predicting the remaining usueful life of the engines, therefore we opted to keep them in the dataset.

The code we used to do this is under the “Imputing Nulls” header. We had to put it at that step in order for the nulls to correctly be imputed.

pacman::p_load(DataExplorer)
plot_missing(Tech_Correct)

Data Validation Table

Next, we utilized the pointblank package to create a data validation table for our cleaned data set. In this table, we checked to make sure that each column was of their expected type, each column was not null, and that some column were greater than or equal to 0.

pacman::p_load(pointblank)
act= action_levels(warn_at= 0.01, notify_at= 0.01)
agent= create_agent(tbl= Tech_Correct, actions= act)
agent %>%
  col_is_date(columns= 'date') %>% 
  col_is_posix(columns= 'time') %>% 
  col_is_integer(columns= vars(esn, unit, flight_cycle, tra, htbleed, nf_dmd, pcnfr_dmd, rul)) %>% 
  col_is_numeric(columns= vars(hpc_eff_mod, hpc_flow_mod, t2, t24, t30, t50, p2, p15, p30, nf, nc, epr, ps30, phi, nrf, nrc, bpr)) %>% 
  col_is_numeric(columns= vars(farb, w31, w32, depart_latitude, depart_longitude, dest_latitude, dest_longitude, distance)) %>%
  col_is_numeric(columns= vars(x65421p11_op630_median_first, x65421p11_op232_median_first,x54321p01_op220_median_first)) %>%
  col_is_numeric(columns= vars(x54321p01_op116_median_first, x44321p02_op420_median_first, x44321p02_op016_median_first)) %>% 
  col_is_character(columns= vars(dataset, operator, depart_icao, destination_icao)) %>%
  col_vals_gte(columns= vars(unit, flight_cycle, rul, distance), value= 0) -> 
  agent 
results= interrogate(agent)
results %>% export_report(filename= 'ge_project_evaluation.html')
## ✔ The agent has been written as `ge_project_evaluation.html`
results
Pointblank Validation
[2022-11-12|17:56:44]

data frame Tech_CorrectWARN 0.01 STOP NOTIFY 0.01
STEP COLUMNS VALUES TBL EVAL UNITS PASS FAIL W S N EXT
1
col_is_date
 col_is_date()

date

1 1
1
0
0

2
col_is_posix
 col_is_posix()

time

1 0
0
1
1

3
col_is_integer
 col_is_integer()

esn

1 1
1
0
0

4
col_is_integer
 col_is_integer()

unit

1 1
1
0
0

5
col_is_integer
 col_is_integer()

flight_cycle

1 1
1
0
0

6
col_is_integer
 col_is_integer()

tra

1 1
1
0
0

7
col_is_integer
 col_is_integer()

htbleed

1 1
1
0
0

8
col_is_integer
 col_is_integer()

nf_dmd

1 1
1
0
0

9
col_is_integer
 col_is_integer()

pcnfr_dmd

1 1
1
0
0

10
col_is_integer
 col_is_integer()

rul

1 1
1
0
0

11
col_is_numeric
 col_is_numeric()

hpc_eff_mod

1 1
1
0
0

12
col_is_numeric
 col_is_numeric()

hpc_flow_mod

1 1
1
0
0

13
col_is_numeric
 col_is_numeric()

t2

1 1
1
0
0

14
col_is_numeric
 col_is_numeric()

t24

1 1
1
0
0

15
col_is_numeric
 col_is_numeric()

t30

1 1
1
0
0

16
col_is_numeric
 col_is_numeric()

t50

1 1
1
0
0

17
col_is_numeric
 col_is_numeric()

p2

1 1
1
0
0

18
col_is_numeric
 col_is_numeric()

p15

1 1
1
0
0

19
col_is_numeric
 col_is_numeric()

p30

1 1
1
0
0

20
col_is_numeric
 col_is_numeric()

nf

1 1
1
0
0

21
col_is_numeric
 col_is_numeric()

nc

1 1
1
0
0

22
col_is_numeric
 col_is_numeric()

epr

1 1
1
0
0

23
col_is_numeric
 col_is_numeric()

ps30

1 1
1
0
0

24
col_is_numeric
 col_is_numeric()

phi

1 1
1
0
0

25
col_is_numeric
 col_is_numeric()

nrf

1 1
1
0
0

26
col_is_numeric
 col_is_numeric()

nrc

1 1
1
0
0

27
col_is_numeric
 col_is_numeric()

bpr

1 1
1
0
0

28
col_is_numeric
 col_is_numeric()

farb

1 1
1
0
0

29
col_is_numeric
 col_is_numeric()

w31

1 1
1
0
0

30
col_is_numeric
 col_is_numeric()

w32

1 1
1
0
0

31
col_is_numeric
 col_is_numeric()

depart_latitude

1 1
1
0
0

32
col_is_numeric
 col_is_numeric()

depart_longitude

1 1
1
0
0

33
col_is_numeric
 col_is_numeric()

dest_latitude

1 1
1
0
0

34
col_is_numeric
 col_is_numeric()

dest_longitude

1 1
1
0
0

35
col_is_numeric
 col_is_numeric()

distance

1 1
1
0
0

36
col_is_numeric
 col_is_numeric()

x65421p11_op630_median_first

1 1
1
0
0

37
col_is_numeric
 col_is_numeric()

x65421p11_op232_median_first

1 1
1
0
0

38
col_is_numeric
 col_is_numeric()

x54321p01_op220_median_first

1 1
1
0
0

39
col_is_numeric
 col_is_numeric()

x54321p01_op116_median_first

1 1
1
0
0

40
col_is_numeric
 col_is_numeric()

x44321p02_op420_median_first

1 1
1
0
0

41
col_is_numeric
 col_is_numeric()

x44321p02_op016_median_first

1 1
1
0
0

42
col_is_character
 col_is_character()

dataset

1 1
1
0
0

43
col_is_character
 col_is_character()

operator

1 1
1
0
0

44
col_is_character
 col_is_character()

depart_icao

1 1
1
0
0

45
col_is_character
 col_is_character()

destination_icao

1 1
1
0
0

46
col_vals_gte
 col_vals_gte()

unit

0

13K 13K
1
0
0

47
col_vals_gte
 col_vals_gte()

flight_cycle

0

13K 13K
1
0
0

48
col_vals_gte
 col_vals_gte()

rul

0

13K 13K
1
0
0

49
col_vals_gte
 col_vals_gte()

distance

0

13K 13K
1
1
0

2022-11-12 17:56:45 EST 1.4 s 2022-11-12 17:56:46 EST
glimpse(Tech_Correct)
## Rows: 13,096
## Columns: 45
## $ dataset                      <chr> "test_FD001", "test_FD001", "test_FD001",…
## $ esn                          <int> 999120, 999120, 999120, 999120, 999120, 9…
## $ unit                         <int> 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 2…
## $ flight_cycle                 <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13…
## $ date                         <date> 2018-02-01, 2018-02-01, 2018-02-08, 2018…
## $ time                         <time> 13:47:42, 18:22:24, 00:46:00, 03:58:00, …
## $ operator                     <chr> "AIC", "AIC", "AIC", "AIC", "AIC", "AIC",…
## $ depart_icao                  <chr> "LFBO", "LFBO", "VIDP", "VABB", "VIDP", "…
## $ destination_icao             <chr> "LFBO", "VIDP", "VABB", "VIDP", "VABB", "…
## $ hpc_eff_mod                  <dbl> 0.0011, -0.0001, -0.0015, -0.0022, -0.006…
## $ hpc_flow_mod                 <dbl> -5e-04, -3e-04, 3e-04, 1e-04, -1e-04, 5e-…
## $ tra                          <int> 100, 100, 100, 100, 100, 100, 100, 100, 1…
## $ t2                           <dbl> 518.67, 518.67, 518.67, 518.67, 518.67, 5…
## $ t24                          <dbl> 182.59, 182.56, 182.54, 183.01, 182.73, 1…
## $ t30                          <dbl> 1588.38, 1590.73, 1590.70, 1588.91, 1588.…
## $ t50                          <dbl> 1397.780, 1405.880, 1401.310, 1403.280, 1…
## $ p2                           <dbl> 14.62, 14.62, 14.62, 14.62, 14.62, 14.62,…
## $ p15                          <dbl> 21.61, 21.61, 21.60, 21.61, 21.61, 21.61,…
## $ p30                          <dbl> 554.43, 553.62, 554.24, 553.79, 553.90, 5…
## $ nf                           <dbl> 2388.07, 2388.06, 2388.08, 2388.06, 2388.…
## $ nc                           <dbl> 9072.24, 9059.50, 9064.83, 9058.04, 9064.…
## $ epr                          <dbl> 1.3, 1.3, 1.3, 1.3, 1.3, 1.3, 1.3, 1.3, 1…
## $ ps30                         <dbl> 47.40, 47.32, 47.19, 47.28, 47.42, 47.21,…
## $ phi                          <dbl> 521.88, 522.03, 521.80, 522.25, 522.17, 5…
## $ nrf                          <dbl> 2388.06, 2388.06, 2388.03, 2388.06, 2388.…
## $ nrc                          <dbl> 8147.58, 8151.74, 8146.37, 8144.65, 8146.…
## $ bpr                          <dbl> 8.3923, 8.4385, 8.4234, 8.3955, 8.4371, 8…
## $ farb                         <dbl> 0.03, 0.03, 0.03, 0.03, 0.03, 0.03, 0.03,…
## $ htbleed                      <int> 393, 392, 392, 391, 393, 394, 391, 392, 3…
## $ nf_dmd                       <int> 2388, 2388, 2388, 2388, 2388, 2388, 2388,…
## $ pcnfr_dmd                    <int> 100, 100, 100, 100, 100, 100, 100, 100, 1…
## $ w31                          <dbl> 38.76, 38.90, 38.96, 38.94, 38.94, 39.11,…
## $ w32                          <dbl> 23.4970, 23.4240, 23.4460, 23.3412, 23.44…
## $ rul                          <int> 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 1…
## $ x44321p02_op016_median_first <dbl> 23.25222, 23.25222, 23.25222, 23.25222, 2…
## $ x44321p02_op420_median_first <dbl> 14.54578, 14.54578, 14.54578, 14.54578, 1…
## $ x54321p01_op116_median_first <dbl> 22.15643, 22.15643, 22.15643, 22.15643, 2…
## $ x54321p01_op220_median_first <dbl> 29.89512, 29.89512, 29.89512, 29.89512, 2…
## $ x65421p11_op232_median_first <dbl> 188.632, 188.632, 188.632, 188.632, 188.6…
## $ x65421p11_op630_median_first <dbl> 231.0214, 231.0214, 231.0214, 231.0214, 2…
## $ depart_latitude              <dbl> 43.635, 43.635, 28.569, 19.092, 28.569, 1…
## $ depart_longitude             <dbl> 1.368, 1.368, 77.112, 72.866, 77.112, 72.…
## $ dest_latitude                <dbl> 43.635, 28.569, 19.092, 28.569, 19.092, 1…
## $ dest_longitude               <dbl> 1.368, 77.112, 72.866, 77.112, 72.866, 76…
## $ distance                     <dbl> 0.0000, 4211.6346, 708.2600, 708.2600, 70…
write.csv(Tech_Correct, file = "Cleaned_Data.csv", row.names = FALSE)